Introduction

TSL sequence service is a raw sequence data management service and it can be excessed using the link <sequences.tsl.ac.uk>. The link can be excessed when you are using NBI network or connect to NBI network using VPN. The purpose of the TSL sequence service is to store the raw data securely in a well organised way through Projects, Samples and Runs, capturing relevant metadata, all to reflect the data structures for easier ENA submission.

How are data organised?

The top level container is the 'project', this holds the metadata about the study and is made of many 'samples' (which are equivalent to different experimental conditions) and in turn a 'sample' holds many 'runs' (which are roughly similar to runs of a sequencer machine).

Raw datasets are provided by your sequencing providers/companies. These data need to be downloaded. The raw data can be very large files. You can temporarily store files at /tsl/data/dropbox/. You may also download a copy to your external drive.

Best way to upload data

The best way to upload raw reads data is to move any combination of folders that each contain all files and only files that are raw reads files (and associated MD5 sum files) associated with a given run submission.

Otherwise, if you have downloaded the data to the hpc dropbox area, the best way to upload the data is using a browser in remote desktop service (RDS). As dropbox folder is mounted to RDS, uploading very large files is much faster. You can also use local browser in your laptop, however, uploading large files (stored in hpc dropbox or your external drive) can be slow, especially when you are using eduroam wifi network instead of using network cable at your desk.

Here, I will show you how to upload your data to TSL sequence service using RDS.

Uploading data

Follow the steps below to connect to RDS:

  1. Open chrome or firefox or microsoft edge browser (old or alternative browsers may not work)

  2. Go to the remote desktop service address https://winrds.nbi.ac.uk/RDWeb/webclient/

  3. Enter your user email () and password

  4. You will see the window remote desktop as shown below.

  5. Open firefox or chrome browser. Open a new tab in the browser and type the TSL sequence service address http://sequences.tsl.ac.uk

  6. Enter your username and password. After successful login, you will see the home page similar to the one below:

There are list of projects listed at the left hand side. You may not see the projects listed out if you are logging in for the first time.

Create a project

  1. Click “New” button as shown above.

A form to fill up about project details will be displayed.

  1. Give a proper project title. Avoid giving a title like project_1, new_project_123 etc. A project like differential gene expression analysis in Arabidopsis thaliana make a proper title.
  2. select the project leader group you are in.
  3. Give a short description of the project
  4. Give a long description of the project
  5. Browser and upload any additional file you have relating to the project. If not, nothing to do.
  6. Tick the box for confirmation.
  7. Click “Create Project”. If everything is fine, it will create a project in the database. If anything is wrong, it will ask you to correct it.

Create Samples

After a project is created, the project pages shows up like below:

You are ready to add samples for the project you have just created.

  1. Click “New” button to add a sample as shown above.
  2. Fill up the form for a sample.
  1. Give a name for the sample
  2. The scientific name of the organsim
  3. Common name (if known) and the NCBI taxonomic id of the organism. You can go to NCBI taxonomy page and search for the organism to get it’s taxonomy id.
  4. In the Conditions, provide the conditions the organism was grown for example, normal glasshourse condition.
  5. If you have additional files relating to the sample, you can browse and upload here. If not, do nothing
  1. Click the tick box to confirm
  2. Click “Create sample” button to create the sample

The next page displayed will show the details of the sample you have just created.

At this stage, you can create more samples by clicking the project title.

You will then go back to the project details page, where you will now find the sample you have created is listed.

To add new sample, start the same steps as before.

Let’s add dataset to the sample now.

Add dataset to sample

  1. Click the sample name (in the project details page) to go to sample details page. This is the same page displayed after you created the sample.
  2. Click “New” and this will display a form for a dataset of the sample. Fill up the details for the dataset. An example is shown below

The fields - sequencing technology, library source, library selection, library type and library strategy have dropdown lists and you will need to select the option for your dataset. See what I have selected above.

Raw data selection for uploading

After you have filled up the form above, you will select the files in the Raw reads section Depending upon where your raw read data files are located, you have two choices: HPC upload or Local filesystem upload. The links to these choices are just below Raw reads section title. By default, HPC Uploads is selected.

HPC Upload

This is a new feature added to the system. If your files are in the HPC file directory at /tsl/data/tempWebUploadToSequences, you can use this choice. This is the recommended choice as the files are simply copied rather than uploading. The limitation is that a given Run submission will only have all files selected from 1 given directory. This choice uploads files very quick, even for very large files. Please read the points under Please read carefully. Please check the two check boxes to declare you have read instructions and you are storing the files in the folder /tsl/data/tempWebUploadToSequences temporarily. As you clicked the two check boxes, you will see an empty input form completing the path: /tsl/data/tempWebUploadToSequences, which for illustration purposes I have entered the word 'cheese' into.

Now to prepare your files for uploading do as following:

  1. Create a folder/s for your sample/s in the directory: /tsl/data/tempWebUploadToSequences. For example, If I am uploading reads for a sample named SampleA, I will create a folder: /tsl/data/tempWebUploadToSequences/sampleA. You can login to HPC terminal and use a mkdir command to create the folder.
mkdir /tsl/data/tempWebUploadToSequences/sampleA
  1. Generate MD5 and copy FASTQ raw data files for sampleA to the folder you created above. You can generate MD5 for each files with md5sum command
md5sum /path/to/sampleA/sampleA_R1.fastq.gz

It will display the md5 like below:

274fd10aa73065e85d5672edcd07e80c  /path/to/sampleA/aampleA_R1.fastq.gz

Copy the FASTQ files to the folder.

cp /path/to/sampleA/sampleA_R1.fastq.gz /tsl/data/tempWebUploadToSequences/sampleA

If it’s paired end, copy both R1 and R2 files to the folder. If you have multiple sets of R1 and R2 FASTQ files for the sample, copy all the files to the folder.

Now, type the folder name to search your files. For example, I have named SampleA folder, so I will search with that name and it will display all files in the folder for my sample. See below:

Fill up the md5 checksum for all files. You have generated using the md5sum command above. In the sibling column, choose the R2 FASTQ file for R1 FASTQ file and vice versa. If you have selected FASTQ-single in the form above, you will not get this sibling column.

If everything is fine, i.e. all files are displayed and you have filled up correct md5 checksum for each files and Siblings (for FASTQ paired only), click “Validate and lock choices”. You still have time to revise your form here. If all looks fine, click “Lock choices”. At this stage, you cannot make changes in the form.

Below, you have option to upload any additional files for the sample.

Now, scroll down and check the two checkboxes. Please read before you check the checkboxes.

Click “Create run” button to upload the FASTQ files to sequences.tsl.ac.uk.

Local filesystem Upload

This is the same feature as before. You will need to browse the filesystem to select your raw FASTQ data files. Here, If I have selected Library type as FASTQ-paired in the form above, therefore, it expects me to upload two files - forward (R1) and reverse (R2) sequences. If FASTQ-single was selected, it expects a single FASTQ file.

  1. Let’s upload R1 FASTQ file first. Click the browse link and navigate to the hpc filesystem. I have my files in HPC dropbox area, so I start by typing \\tsl-hpc-data\HPC-Data as shown below:

  1. I have navigated to my folder \\tsl-hpc-data\HPC-Data\dropbox\ram (see above), where my test FASTQ files are located. Select the file and click “open” to upload forward (R1) FASTQ file. Do the same to upload reverse (R2) FASTQ file.

  2. You will also need to supply the md5 checksum for the FASTQ files. Your sequence provider will also give you the md5 checksum for the files.

If you have, copy/paste the md5 checksum for the files in the box below where it says MD5 (required) for the respective files.

If you don’t have md5 checksum, you can generate it from HPC command line using the command

md5sum /path/to/filename

There is a link to generate md5 for the files. But be warned that if your file is corrupted, the md5 generated will be different and will not represent the original file. So it is best to use the md5 provided by your sequencing provider.

  1. Here, if we have more dataset for the sample, click “+Add another” that will give you option to browse the files to upload. Add the files and md5 checksum in the same way as we did before. If no more dataset to add, you don’t need to click “+Add another”.

  2. If you have addtional files relating to the dataset, you can browse the files and add here. If not, do nothing.

  3. Check the box to confirm and then click “create run”. If the button is not clickable, there is something you have not added. Add all the details and try again.

Congratulations, you have successfully created a project and a sample and then uploaded the raw data files for the samples in TSL sequences servce.

Where can I find my uploaded files for analysis in HPC?

The raw datasets uploaded to TSL sequences can be accessed in the HPC for analysis. The file paths to the datasets have a format like below

/tsl/data/reads/{your_group}/{your_project}/{sample_name}/{run_name}/raw/{fastqfilename}

Have you noticed I mentioned your_group above. There are following groups directory in “/tsl/data/reads” and each one is named after the username of your group leader.

  1. bioinformatics
  2. czipfel
  3. jjones
  4. maw
  5. mmoscou
  6. ntalbot
  7. skamoun
  8. two_blades
  9. jep23kod

For example, I have following details

project_name - differential_gene_expression_in_tomato
sample_name - sp2206
run_name - sp2206_001
fastqfiles - dataset_1.fastq.gz, dataset_2.fastq.gz

My group is bioinformatics and after I created the project and added the sample with a samplename and raw files given above, my files will be in the path:

/tsl/data/reads/bioinformatics/differential_gene_expression/sp2206/sp2206_001/raw/dataset_1.fastq.gz
/tsl/data/reads/bioinformatics/differential_gene_expression/sp2206/sp2206_001/raw/dataset_2.fastq.gz

Happy Data Analysis !!!